Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić, Laurent Gatto, Rob Foy John Davey, Dávid Molnár and Ian Roberts
Last modified: 27 May 2016
Nature, Dec 2014
https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919
http://www.revolutionanalytics.com/companies-using-r
Sidney Harris - New York Times
New York Times, July 2011
R)To launch RStudio, find the RStudio icon in the menu bar on the left of the screen and click
RStudio screenshot
2 + 2
20/5 - sqrt(25) + 3^2
sin(pi/2)Note: The number in the square brackets is an indicator of the position in the output. In this case the output is a ‘vector’ of length 1 (i.e. a single number). More on vectors coming up…
<-x <- 10
x
myNumber <- 25
myNumbersqrt(myNumber)x + myNumberx <- 21
xx <- myNumber
xmyNumber <- myNumber + sqrt(16)
myNumbersin(x)sum(3,4,5,6)
max(3,4,5,6)
min(3,4,5,6)seq(from = 2, to = 20, by = 4)
seq(2, 20, 4)c combines its arguments into a vector:x <- c(3,4,5,6)
x[] indicate the position within the vector (the index).[] notation:x[1]
x[4]y <- c(2,3)
x[y]x <- c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12)x <- 3:12
xseq() function, which returns a vector:x <- seq(2, 20, 4)
xx <- seq(2, 20, length.out=5)
xrep() function:y <- rep(3, 5)
yy <- rep(1:3, 5)
yx <- 3:12
# Extract elements from x:
x[3:7]
x[seq(2, 6, 2)]
x[rep(3, 2)]y <- c(x, 1)
yz <- c(x, y)
zx <- 3:12
x[-3]
x[-(5:7)]
x[-seq(2, 6, 2)]x[6] <- 4
x
x[3:5] <- 1
xRemember!
x <- 1:10
y <- x*2yz <- x^2zy + zx + 1:2x + 1:3Warning in x + 1:3: longer object length is not a
multiple of shorter object length
[1] 2 4 6 5 7 9 8 10 12 11
gene.names <- c("Pax6", "Beta-actin", "FoxP2", "Hox9")
gene.names[1] "Pax6" "Beta-actin" "FoxP2"
[4] "Hox9"
names() function, which can be useful to keep track of the meaning of our data:gene.expression <- c(0, 3.2, 1.2, -2)
names(gene.expression) <- gene.names
gene.expression Pax6 Beta-actin FoxP2 Hox9
0.0 3.2 1.2 -2.0
names() function to get a vector of the names of an object:names(gene.expression)| Species | Genome size (Mb) | Protein coding genes |
|---|---|---|
| Homo sapiens | 3,102 | 20,774 |
| Mus musculus | 2,731 | 23,139 |
| Drosophila melanogaster | 169 | 13,937 |
| Caenorhabditis elegans | 100 | 20,532 |
| Saccharomyces cerevisiae | 12 | 6,692 |
c function. Create a species.name vector and use this vector to name the values in the other two vectors.coding.bases.coding.bases and genome.size vectors to calculate this. (See earlier slides for how to do division in R.)genome.size <- c(3102, 2731, 169, 100, 12)
coding.genes <- c(20774, 23139, 13937, 20532, 6692)
species.name <- c("H. sapiens","M. musculus",
"D. melanogaster","C. elegans",
"S. cerevisiae")
names(genome.size) <- species.name
names(coding.genes) <- species.namecoding.bases <- coding.genes*0.0015
coding.bases H. sapiens M. musculus D. melanogaster
31.1610 34.7085 20.9055
C. elegans S. cerevisiae
30.7980 10.0380
coding.pc <- coding.bases/genome.size*100
coding.pc H. sapiens M. musculus D. melanogaster
1.004545 1.270908 12.370118
C. elegans S. cerevisiae
30.798000 83.650000
coding.bases[1]/coding.bases[5]H. sapiens
3.104304
genome.size[1]/genome.size[5]H. sapiens
258.5
coding.pc) but sometimes it is not (when we are comparing human to yeast). We can remove names by setting them to the special NULL value:names(coding.pc) <- NULL
coding.pc[1] 1.004545 1.270908 12.370118 30.798000
[5] 83.650000
Typing lots of commands directly to R can be tedious. A better way is to write the commands to a file and then load it into R.
md-format
solution-exercise1.Rmd for solution to Exercise 1? followed by the function name. For example:?seqexample function:example(seq)?? followed by your guess. R will return a list of possibilities:??plot; end of line (Enables multiple commands to be placed on one line of text)# comment (indicates text is a comment and not executed)+ command line wrap (R is waiting for you to complete an expression)sum() is in the base package and sd(), which calculates the standard deviation of a vector, is in the stats packageCRAN packages can be installed using install.packages()
install.packages(name.of.my.package)source("http://bioconductor.org/biocLite.R")biocLite() function:biocLite("PackageName")install.packages() function to install it:install.packages("ggplot2")DESeq is a Bioconductor package (http://www.bioconductor.org) for the analysis of RNA-seq data:source("http://www.bioconductor.org/biocLite.R")
biocLite("DESeq")library(...) function to load the newly installed features:library(ggplot2) # loads ggplot functions
library(DESeq) # loads DESeq functions
library() # Lists all the packages
# you've got installed | First_Name | Second_Name | Full_Name | Sex | Age | Weight | Consent |
|---|---|---|---|---|---|---|
| Adam | Jones | Adam Jones | Male | 50 | 70.8 | TRUE |
| Eve | Parker | Eve Parker | Female | 21 | 67.9 | TRUE |
| John | Evans | John Evans | Male | 35 | 75.3 | FALSE |
| Mary | Davis | Mary Davis | Female | 45 | 61.9 | TRUE |
| Peter | Baker | Peter Baker | Male | 28 | 72.4 | FALSE |
| Paul | Daniels | Paul Daniels | Male | 31 | 69.9 | FALSE |
| Joanna | Edwards | Joanna Edwards | Female | 42 | 63.5 | FALSE |
| Matthew | Smith | Matthew Smith | Male | 33 | 71.5 | TRUE |
| David | Roberts | David Roberts | Male | 57 | 73.2 | FALSE |
| Sally | Wilson | Sally Wilson | Female | 62 | 64.8 | TRUE |
age <- c(50, 21, 35, 45, 28, 31, 42, 33, 57, 62)
weight <- c(70.8, 67.9, 75.3, 61.9, 72.4, 69.9,
63.5, 71.5, 73.2, 64.8)firstName <- c("Adam", "Eve", "John", "Mary",
"Peter", "Paul", "Joanna", "Matthew",
"David", "Sally")
secondName <- c("Jones", "Parker", "Evans", "Davis",
"Baker","Daniels", "Edwards", "Smith",
"Roberts", "Wilson")TRUE and FALSE:consent <- c(TRUE, TRUE, FALSE, TRUE, FALSE,
FALSE, FALSE, TRUE, FALSE, TRUE)c(20, "a string", TRUE)[1] "20" "a string" "TRUE"
class() function: class(firstName)
class(age)
class(weight)
class(consent)sex <- c("Male", "Female", "Male", "Female", "Male",
"Male", "Female", "Male", "Male", "Female")
sex [1] "Male" "Female" "Male" "Female" "Male"
[6] "Male" "Female" "Male" "Male" "Female"
factor(sex) [1] Male Female Male Female Male Male
[7] Female Male Male Female
Levels: Female Male
paste() function joins character vectors together)patients <- data.frame(firstName, secondName,
paste(firstName, secondName),
sex, age, weight, consent)patients firstName secondName paste.firstName..secondName. ...
1 Adam Jones Adam Jones
2 Eve Parker Eve Parker
3 John Evans John Evans
4 Mary Davis Mary Davis
5 Peter Baker Peter Baker
6 Paul Daniels Paul Daniels
7 ...
$’ operator:patients$agepaste() command)names() function, and we can use the same function to see the names:names(patients) <- c("First_Name", "Second_Name",
"Full_Name", "Sex", "Age",
"Weight", "Consent")names(patients)9cfbfd4738f27e3783f35d4140c28414a80b6c75 ##Naming data frame variables
patients <- data.frame(First_Name = firstName,
Second_Name = secondName,
Full_Name = paste(firstName, secondName),
Full_Name = paste(firstName,
secondName),
Sex = sex,
Age = age,
Weight = weight,
Consent = consent)names(patients)patients$First_Namefactor():patients <- data.frame(First_Name = firstName,
Second_Name = secondName,
Full_Name = paste(firstName,
secondName),
Sex = factor(sex),
Age = age,
Weight = weight,
Consent = consent,
stringsAsFactors = FALSE)patients$Sex
patients$First_NameAll columns are assumed to contain the same data type, e.g. numerical
e <- matrix(1:10, nrow=5, ncol=2)
e [,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
rowMeans(e)[1] 3.5 4.5 5.5 6.5 7.5
object[rows, colums]e[1,2]
e[1,]
patients[1,2]
patients[1,]s <- letters[1:5]
s[1] "a" "b" "c" "d" "e"
# View some of the values in s:s[c(1,3)]
s[c(TRUE, FALSE, TRUE, FALSE, FALSE)]TRUE and FALSE values to subset the vectora <- 1:5
# Logical tests:
a < 3[1] TRUE TRUE FALSE FALSE FALSE
s[a < 3][1] "a" "b"
<, >, <=, >=, ==, !=!, &, |, xor
TRUE, FALSE)s[1] "a" "b" "c" "d" "e"
a[1] 1 2 3 4 5
s[a > 1 & a <3]
s[a == 2]country, continent, and height
country a character vector but continent a factorsummary() function on your data frame. What does it do? How does it treat vectors (numeric, character, logical) and factors? (What does it do for matrices?)patients[patients$Age < 40, ]patients[patients$Consent == TRUE, ]patients[patients$Sex=="Male" & patients$Weight>=70.8, ]read.table()read.csv(), read.delim()write.table()write.csv()wd)setwd("/home/participant/Course_Materials")Before we even start the analysis, we need to be sure of where the data are located on our hard drive
getwd()list.files()file.choose()rawdata)rawData <- read.delim("countData.txt")file.choose():myfile <- file.choose()
rawData <- read.delim(myfile)sep="," or the function read.csv():read.csv("countData.csv")?read.tableAlways check the object to make sure the contents and dimensions are as you expect
# View the first 10 rows to ensure import is OK
rawData[1:10,] View() function to get a display of the data in RStudio:View(rawData)Once we have read the data successfully, we can start to interact with it
The object we have created is a data frame:
class(rawData)[1] "data.frame"
ncol(rawData)
nrow(rawData)
dim(rawData)str(rawData)'data.frame': 50 obs. of 5 variables:
$ Patient: int 1 2 3 4 5 6 7 8 9 10 ...
$ Nuclei : int 65 51 37 37 45 46 65 59 49 46 ...
$ NB_Amp : int 0 3 2 2 2 4 1 1 0 0 ...
$ NB_Nor : int 63 43 35 35 42 41 64 54 48 45 ...
$ NB_Del : int 2 5 0 0 1 1 0 4 1 1 ...
colnames(rawData)[1] "Patient" "Nuclei" "NB_Amp" "NB_Nor" "NB_Del"
rawData$NucleiLike families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham - RStudio chief scientist and author of dplyr, ggplot2 and others)
NA values, which means the values are missing – a common occurrence in real data collectionNA is a special value that can be present in objects of any type (logical, character, numeric etc)NA is not the same as NULL:
NULL is an empty R object.NA is one missing value within an R object (like a data frame or a vector)NAs gracefully:x <- c(1, NA, 3)
length(x)NAs, and functions often have their own arguments (like na.rm) for handling them:mean(x, na.rm = TRUE)
mean(na.omit(x))which() function to select indices from a logical vector that are TRUE# Create an index of results:
prop <- rawData$NB_Amp / rawData$Nucleiprop > 0.33# Get sample names of amplified patients:
amp <- which(prop > 0.33) ampplot(prop, ylim=c(0,1))
# Add a horizonal line:
abline(h=0.33) write.csv(rawData[amp,], file="selectedSamples.csv")getwd() # print working directory
list.files() # list files in working directory(NB_Amp / Nuclei < 0.33 & NB_Del == 0)norm <- which(prop < 0.33 & rawData$NB_Del == 0)
norm [1] 3 4 7 15 20 24 36 37 42 47
write.csv(rawData[norm,], "My_NB_output.csv")lattice, ggplot2plot() will make a scatter plot with the values of the vector on the y axis, and indices in the x axis
patients$Weight [1] 70.8 67.9 75.3 61.9 72.4 69.9 63.5 71.5 73.2 64.8
plot(patients$Weight)plot():
patients$Age [1] 50 21 35 45 28 31 42 33 57 62
plot(patients$Age, patients$Weight)plot() functionbarplot()barplot(patients$Age)barplot(summary(patients$Sex))hist(patients$Weight)boxplot(patients$Weight, horizontal = TRUE)y ~ x means put continuous variable y on the y axis and categorical x on the x axisboxplot(patients$Weight ~ patients$Sex, horizontal = T)example(dotchart)example(stripchart)example(vioplot) # From vioplot libraryexample(beeswarm) # From beeswarm libraryozone.csv:
weather <- read.csv("ozone.csv")
View(weather)| Ozone | Solar.R | Wind | Temp | Month | Day |
|---|---|---|---|---|---|
| 41 | 190 | 7.4 | 67 | 5 | 1 |
| 36 | 118 | 8.0 | 72 | 5 | 2 |
| 12 | 149 | 12.6 | 74 | 5 | 3 |
| 18 | 313 | 11.5 | 62 | 5 | 4 |
| NA | NA | 14.3 | 56 | 5 | 5 |
| 28 | NA | 14.9 | 66 | 5 | 6 |
plot(weather$Solar.R, weather$Ozone)hist(weather$Temp)boxplot(weather$Ozone)boxplot(weather$Ozone ~ weather$Month)plot() comes with a large collection of arguments that can be set when we call the function:
?plot and ?parplot(patients$Weight, type = "l")plot(patients$Weight, type = "b")main argument:plot(patients$Age, patients$Weight,
main = "Relationship between Weight and Age")plot(patients$Age, patients$Weight, xlab = "Age")plot(patients$Age, patients$Weight, ylab = "Weight")ylim and xlim are used to specify axis limitsplot(patients$Age,patients$Weight,
ylab="Weight",
xlab="Age",
main="Relationship between Weight and Age",
xlim=c(10,70),
ylim=c(60,80))"red", "orange","green","blue","yellow"…firebrick, gray25, mediumseagreen, cornsilk2, grey26, purple2, grey73, darkorange4…
colours()Can also use Red Green Blue and hexadecimal values:
rgb(0.7, 0.7, 0.7) → A light grey in RGB format`"#B3B3B3" → The same light grey in hexadecimal"#0000FF88"→ A semi-transparent blue, in hexadecimal
Changing the col argument to plot() changes the colour that the points are plotted in:
plot(patients$Age, patients$Weight, col = "red")plot(patients$Age, patients$Weight, pch = 16)Warning in plot.xy(xy.coords(x, y), type = type, ...):
unimplemented pch value '26'
plot(patients$Age, patients$Weight, pch = "X")Character expansion:
plot(patients$Age, patients$Weight, cex = 3)Character expansion:
plot(patients$Age, patients$Weight, cex = 0.2)plot(patients$Age, patients$Weight,
pch = 1:10, cex = 1:5,
col = c("red", "orange", "green", "blue"))plot()
boxplot(patients$Weight~patients$Sex,
xlab = "Sex",
ylab = "Weight",
main = "Relationship between Weight and Gender",
col = c("blue","yellow"))Can you re-create the following plots? Hint:
breaks and freq arguments to hist (?hist) to create 50 bins and display density rather than frequency?rainbow)plot(weather$Solar.R, weather$Ozone, col="orange", pch=16,
ylab="Ozone level", xlab="Solar Radiation",
main="Relationship between ozone level and
solar radiation")hist(weather$Temp, col="purple", xlab="Temperature",
main="Distribution of Temperature", breaks = 50:100,
freq=FALSE)rainbow() function is used to create a vector of colours for the boxplot; in other words a palette:
heat.colors(), terrain.colors(), topo.colors(), cm.colors()boxplot(weather$Ozone ~ weather$Month, col=rainbow(5))RColorBrewer package:library(RColorBrewer)
display.brewer.all()boxplot(weather$Ozone ~ weather$Month,
col=brewer.pal(5,"Set1"))